Goals

If we are successful, you should be able to hit the ground running on your own project with R

Setup

Install R from CRAN

Install RStudio from RStudio

Install the tidyverse, lubridate, and ggmap packages

install.packages("tidyverse", "lubridate", "ggmap")
#you will see activity in the console as the packages are installed

Create a folder called “R workshop”

Download the 311 data from the WPRDC

Move that CSV into the “R workshop” folder

What is R

R is an interpreted programming language for statistics

RStudio

Integrated Development Environment for R

  • Script
  • Console
  • Enviroment (History, Connections, Git)
  • Files (Plots, Packages, Help, Viewer)

How Does R Work?

Basic Functions

  • add
  • subtract
  • strings
1
## [1] 1
1 + 2
## [1] 3
10 / 2
## [1] 5
5 * 2
## [1] 10
"this is a string. strings in R are surrounded by quotation marks."
## [1] "this is a string. strings in R are surrounded by quotation marks."

Type matters

"1" + 1
## Error in "1" + 1: non-numeric argument to binary operator
str(1)
##  num 1
str("1")
##  chr "1"

Objects, Functions, and Assignment

Reminder that objects are shown in the Environment panel

x
## Error in eval(expr, envir, enclos): object 'x' not found
x <- 1
x
## [1] 1

You can overwrite (or update) an object

x <- 2
x
## [1] 2
x <- 1
y <- 5

x + y
## [1] 6

c() means “concatenate”. It creates vectors

a <- c(x, y)
a
## [1] 1 5
z <- sum(a)
z
## [1] 6

Dataframes

my_df <- data.frame(a = 1:5,
                b = 6:10,
                c = c("a", "b", "c", "d", "e"))
my_df
##   a  b c
## 1 1  6 a
## 2 2  7 b
## 3 3  8 c
## 4 4  9 d
## 5 5 10 e

Select individual columns in a dataframe with the $ operator

my_df$a
## [1] 1 2 3 4 5

“<-” and “=” do the same thing. To minimize confusion, many people use “<-” for objects and “=” for assigning variables within functions or dataframes

x <- 1

a <- data.frame(a = 1:5,
                b = 6:10)
a
##   a  b
## 1 1  6
## 2 2  7
## 3 3  8
## 4 4  9
## 5 5 10

Logic

“x == y” means “is x equal to y?”

1 == 2
## [1] FALSE

“!” means “not”

!FALSE
## [1] TRUE

TRUE = 1, FALSE = 0

TRUE + FALSE
## [1] 1
TRUE + TRUE
## [1] 2

R is case-sensitive

"a" == "A"
## [1] FALSE

Loading packages

library(package_name)

You have to load your packages each time you start R

Commenting

#start a line of code with a "#" to make that line a comment
#1 + 1
#code that is "commented out" will not be executed

Getting help with R

Use the built-in documentation. Put a “?” before the name of a function to access the documentation in the Help panel

?mean
## starting httpd help server ... done

StackOverflow

Working Directory

How to set up the working directory

getwd()
## [1] "C:/Users/conor/githubfolder/pittsburgh_311/code_for_pittsburgh_presentation"

Session menu -> Set working directory -> choose your folder

setwd()

Compare to Excel

R separates the data from the analysis. The data is stored in files (CSV, JSON, etc). The analysis is stored in scripts. This makes it easier to share analysis performed in R. No need to take screenshots of your workflow in Excel. You have a record of everything that was done to the data.

Compare to other programming languages

What is the Tidyverse?

A group of R packages that use a common grammar for wranging, analyzing, modeling, and graphing data

Tidyverse functions

  • select columns
  • filter rows
  • mutate new columns
  • group_by and summarize rows
  • ggplot2 your data

read_csv() reads CSV files from your working directory

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1.9000     v purrr   0.2.4     
## v tibble  1.4.2          v dplyr   0.7.4     
## v tidyr   0.8.0          v stringr 1.2.0     
## v readr   1.1.1          v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
#df <- read_csv("your_file_name_here.csv")

df <- read_csv("https://raw.githubusercontent.com/conorotompkins/pittsburgh_311/master/data/pittsburgh_311_2018_04_10.csv")
## Parsed with column specification:
## cols(
##   `_id` = col_integer(),
##   REQUEST_ID = col_integer(),
##   CREATED_ON = col_datetime(format = ""),
##   REQUEST_TYPE = col_character(),
##   REQUEST_ORIGIN = col_character(),
##   STATUS = col_integer(),
##   DEPARTMENT = col_character(),
##   NEIGHBORHOOD = col_character(),
##   COUNCIL_DISTRICT = col_integer(),
##   WARD = col_integer(),
##   TRACT = col_double(),
##   PUBLIC_WORKS_DIVISION = col_integer(),
##   PLI_DIVISION = col_integer(),
##   POLICE_ZONE = col_integer(),
##   FIRE_ZONE = col_character(),
##   X = col_double(),
##   Y = col_double(),
##   GEO_ACCURACY = col_character()
## )
colnames(df) <- tolower(colnames(df)) #make all the column names lowercase. this is a personal preference

#initial data munging to get the dates in shape
df %>%
  mutate(date = ymd(str_sub(created_on, 1, 10)),
         time = hms(str_sub(created_on, 11, 18)),
         month = month(date, label = TRUE), 
         year = year(date),
         yday = yday(date)) -> df

Explore the data

df #simply type the name of the object to preview it
## # A tibble: 225,189 x 23
##     `_id` request_id created_on          request_type       request_origin
##     <int>      <int> <dttm>              <chr>              <chr>         
##  1 154245      54111 2016-03-10 13:52:00 Rodent control     Call Center   
##  2 154246      53833 2016-03-09 14:22:00 Rodent control     Call Center   
##  3 154247      52574 2016-03-03 07:13:00 Potholes           Call Center   
##  4 154248      54293 2016-03-11 10:12:00 Building Without ~ Control Panel 
##  5 154249      53560 2016-03-08 14:57:00 Potholes           Call Center   
##  6 154250      49519 2016-02-22 09:10:00 Potholes           Call Center   
##  7 154251      49484 2016-02-22 08:03:00 Potholes           Call Center   
##  8 154252      53787 2016-03-09 12:21:00 Rodent control     Call Center   
##  9 154253      52887 2016-03-04 12:49:00 Potholes           Call Center   
## 10 154254      53599 2016-03-08 16:03:00 Rodent control     Call Center   
## # ... with 225,179 more rows, and 18 more variables: status <int>,
## #   department <chr>, neighborhood <chr>, council_district <int>,
## #   ward <int>, tract <dbl>, public_works_division <int>,
## #   pli_division <int>, police_zone <int>, fire_zone <chr>, x <dbl>,
## #   y <dbl>, geo_accuracy <chr>, date <date>, time <S4: Period>,
## #   month <ord>, year <dbl>, yday <dbl>
glimpse(df) #get a summary of the dataframe
## Observations: 225,189
## Variables: 23
## $ `_id`                 <int> 154245, 154246, 154247, 154248, 154249, ...
## $ request_id            <int> 54111, 53833, 52574, 54293, 53560, 49519...
## $ created_on            <dttm> 2016-03-10 13:52:00, 2016-03-09 14:22:0...
## $ request_type          <chr> "Rodent control", "Rodent control", "Pot...
## $ request_origin        <chr> "Call Center", "Call Center", "Call Cent...
## $ status                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ department            <chr> "Animal Care & Control", "Animal Care & ...
## $ neighborhood          <chr> "Middle Hill", "Squirrel Hill North", "L...
## $ council_district      <int> 6, 8, 9, NA, 9, 9, 9, 3, 9, 1, 4, 4, 9, ...
## $ ward                  <int> 5, 14, 12, NA, 13, 13, 13, 16, 13, 23, 1...
## $ tract                 <dbl> 42003050100, 42003140300, 42003120800, N...
## $ public_works_division <int> 3, 3, 2, NA, 2, 2, 2, 4, 2, 1, 4, 4, 2, ...
## $ pli_division          <int> 5, 14, 12, NA, 13, 13, 13, 16, 13, 23, 1...
## $ police_zone           <int> 2, 4, 5, NA, 5, 5, 5, 3, 5, 1, 6, 3, 5, ...
## $ fire_zone             <chr> "2-1", "2-18", "3-12", NA, "3-17", "3-17...
## $ x                     <dbl> -79.97765, -79.92450, -79.91455, NA, -79...
## $ y                     <dbl> 40.44579, 40.43986, 40.46527, NA, 40.459...
## $ geo_accuracy          <chr> "APPROXIMATE", "APPROXIMATE", "EXACT", "...
## $ date                  <date> 2016-03-10, 2016-03-09, 2016-03-03, 201...
## $ time                  <S4: Period> 13H 52M 0S, 14H 22M 0S, 7H 13M 0S...
## $ month                 <ord> Mar, Mar, Mar, Mar, Mar, Feb, Feb, Mar, ...
## $ year                  <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016...
## $ yday                  <dbl> 70, 69, 63, 71, 68, 53, 53, 69, 64, 68, ...

%>% means “and then”

%>% passes the dataframe to the next function

select

df %>% #select the dataframe
  select(date, request_type) #select the date and request_type columns
## # A tibble: 225,189 x 2
##    date       request_type             
##    <date>     <chr>                    
##  1 2016-03-10 Rodent control           
##  2 2016-03-09 Rodent control           
##  3 2016-03-03 Potholes                 
##  4 2016-03-11 Building Without a Permit
##  5 2016-03-08 Potholes                 
##  6 2016-02-22 Potholes                 
##  7 2016-02-22 Potholes                 
##  8 2016-03-09 Rodent control           
##  9 2016-03-04 Potholes                 
## 10 2016-03-08 Rodent control           
## # ... with 225,179 more rows
df %>% 
  select(date, request_type) %>% 
  filter(request_type == "Potholes") #use the string "Potholes" to filter the dataframe
## # A tibble: 31,735 x 2
##    date       request_type
##    <date>     <chr>       
##  1 2016-03-03 Potholes    
##  2 2016-03-08 Potholes    
##  3 2016-02-22 Potholes    
##  4 2016-02-22 Potholes    
##  5 2016-03-04 Potholes    
##  6 2016-03-11 Potholes    
##  7 2016-03-08 Potholes    
##  8 2016-03-08 Potholes    
##  9 2016-03-08 Potholes    
## 10 2016-03-08 Potholes    
## # ... with 31,725 more rows

mutate

df %>% 
  select(date, request_type) %>% 
  filter(request_type == "Potholes") %>% 
  mutate(weekday = wday(date, label = TRUE))
## # A tibble: 31,735 x 3
##    date       request_type weekday
##    <date>     <chr>        <ord>  
##  1 2016-03-03 Potholes     Thu    
##  2 2016-03-08 Potholes     Tue    
##  3 2016-02-22 Potholes     Mon    
##  4 2016-02-22 Potholes     Mon    
##  5 2016-03-04 Potholes     Fri    
##  6 2016-03-11 Potholes     Fri    
##  7 2016-03-08 Potholes     Tue    
##  8 2016-03-08 Potholes     Tue    
##  9 2016-03-08 Potholes     Tue    
## 10 2016-03-08 Potholes     Tue    
## # ... with 31,725 more rows

group_by and summarize

(df %>% 
  select(date, request_type) %>% #select columns
  filter(request_type == "Potholes") %>% #filter by "Potholes"
  mutate(month = month(date, label = TRUE)) %>% #add month column
  group_by(request_type, month) %>% #group by the unqiue request_type values and month values
  summarize(count = n()) %>% #summarize to count the number of rows in each combination of request_type and month
  arrange(desc(count)) -> df_potholes_month) #arrange the rows by the number of requests
## # A tibble: 12 x 3
## # Groups:   request_type [1]
##    request_type month count
##    <chr>        <ord> <int>
##  1 Potholes     Feb    5569
##  2 Potholes     Mar    3961
##  3 Potholes     Apr    3873
##  4 Potholes     May    3388
##  5 Potholes     Jan    3089
##  6 Potholes     Jun    2896
##  7 Potholes     Jul    2688
##  8 Potholes     Aug    1913
##  9 Potholes     Nov    1344
## 10 Potholes     Sep    1260
## 11 Potholes     Oct    1113
## 12 Potholes     Dec     641

Put your code in parentheses to execute it AND print the output in the console

Making graphs with 311 data

ggplot2

  • aesthetics (the columns you want to graph with)
  • geoms (the shapes you want to use to graph the data)
ggplot(data = _ , aes(x = _, y = _)) +
  geom_()

Pipe your data directly into ggplot2

some_dataframe %>% 
  ggplot(data = _ , aes(x = _, y = _)) +
  geom_()

Graph the number of pothole requests per month

df_potholes_month %>% 
  ggplot(aes(x = month, y = count)) + #put the month column on the x axis, count on the y axis
  geom_col() #graph the data with columns

Make it pretty. Add a title, subtitle, axes labels, captions, and themes

df_potholes_month %>% 
  ggplot(aes(month, count)) +
  geom_col() + 
  labs(title = "Pothole requests to Pittsburgh 311",
       x = "",
       y = "Number of requests",
       caption = "Source: Western Pennsylvania Regional Datacenter") +
  theme_bw()

Make a line graph of the number of pothole requests in the dataset by date

df %>% 
  filter(request_type == "Potholes") %>% 
  count(date) #group_by and summarize the number of rows per date
## # A tibble: 983 x 2
##    date           n
##    <date>     <int>
##  1 2015-04-20   119
##  2 2015-04-21   101
##  3 2015-04-22   109
##  4 2015-04-23   102
##  5 2015-04-24    84
##  6 2015-04-27    85
##  7 2015-04-28   101
##  8 2015-04-29   107
##  9 2015-04-30    83
## 10 2015-05-01    66
## # ... with 973 more rows
#assign labels to objects to save some typing
my_title <- "Pothole requests to Pittsburgh 311"
my_caption <- "Source: Western Pennsylvania Regional Datacenter"

df %>% 
  filter(request_type == "Potholes") %>% 
  count(date) %>% 
  ggplot(aes(date, n)) +
  geom_line() + #use a line to graph the data
  labs(title = my_title, #use the object you created earlier
       x = "",
       y = "Number of requests",
       caption = my_caption) + #use the object you created earlier
  theme_bw(base_family = 18) #base_family modifies the size of the font
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Note that ggplot2 automatically formats the axis labels for dates

Graph the data by number of requests per day of the year

(df %>% 
  select(request_type, date) %>% 
  filter(request_type == "Potholes") %>% 
  mutate(year = year(date), #create a year column
         yday = yday(date)) %>% #create a day of the year column
  count(year, yday) -> df_day_of_year)  #shortcut for group_by + summarize for counting. returns "n"
## # A tibble: 983 x 3
##     year  yday     n
##    <dbl> <dbl> <int>
##  1  2015   110   119
##  2  2015   111   101
##  3  2015   112   109
##  4  2015   113   102
##  5  2015   114    84
##  6  2015   117    85
##  7  2015   118   101
##  8  2015   119   107
##  9  2015   120    83
## 10  2015   121    66
## # ... with 973 more rows
df_day_of_year %>% 
  ggplot(aes(yday, n, group = year)) + #color the lines by year. #as.factor() turns the year column from integer to factor (ordinal string)
  geom_line() + 
  labs(title = my_title,
       x = "Day of the year",
       y = "Number of requests",
       caption = my_caption) +
  theme_bw(base_family = 18)
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

That plotted a line for each year, but there is no way to tell which line corresponds with which year

Color the lines by the year

df_day_of_year %>% 
  ggplot(aes(yday, n, color = as.factor(year))) + #color the lines by year. #as.factor() turns the year column from integer to factor (ordinal string)
  geom_line() + 
  labs(title = my_title,
       x = "Day of the year",
       y = "Number of requests",
       caption = my_caption) +
  theme_bw(base_family = 18)
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Graph the cumulative sum of pothole requests per year

(df %>% 
  select(request_type, date) %>% 
  filter(request_type == "Potholes") %>% 
  mutate(year = year(date),
         yday = yday(date)) %>% 
  arrange(date) %>% #always arrange your data for cumulative sums
  group_by(year, yday) %>%
  summarize(n = n()) %>% 
  ungroup() %>% 
  group_by(year) %>% 
  mutate(cumsum = cumsum(n)) -> df_cumulative_sum) #calculate the cumulative sum per year
## # A tibble: 983 x 4
## # Groups:   year [4]
##     year  yday     n cumsum
##    <dbl> <dbl> <int>  <int>
##  1  2015   110   119    119
##  2  2015   111   101    220
##  3  2015   112   109    329
##  4  2015   113   102    431
##  5  2015   114    84    515
##  6  2015   117    85    600
##  7  2015   118   101    701
##  8  2015   119   107    808
##  9  2015   120    83    891
## 10  2015   121    66    957
## # ... with 973 more rows
df_cumulative_sum %>% 
  ggplot(aes(yday, cumsum, color = as.factor(year))) +
  geom_line(size = 2) +
  labs(title = my_title,
       x = "Day of the year",
       y = "Cumulative sum of requests",
       caption = my_caption) +
  scale_color_discrete("Year") +
  theme_bw(base_family = 18)
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Making an area chart in R

Since 2015 and 2018 have incomplete data, filter them out

df %>% 
  filter(date >= "2016-01-01",
         date <= "2018-01-01") -> df_filtered
df_filtered %>% 
  count(request_type, sort = TRUE) %>% 
  top_n(5) %>% #select the top 5 request types
  ungroup() -> df_top_requests
df_filtered %>% 
  semi_join(df_top_requests) %>% #joins are ways to combine two dataframes
  count(request_type, month) %>% 
  ggplot(aes(month, n, group = request_type, fill = request_type)) +
  geom_area() +
  scale_fill_discrete("Request type") + #change the name of the color legend
  scale_y_continuous(expand = c(0, 0)) + #remove the padding around the edges
  scale_x_discrete(expand = c(0, 0)) +
  labs(title = "Top 5 types of 311 requests in Pittsburgh",
       subtitle = "2016 to 2017",
       x = "",
       y = "Number of requests",
       caption = my_caption) +
  theme_bw(base_family = 18) +
  theme(panel.grid = element_blank()) #remove the gridlines fom the plot

Mapping in R

Load the ggmap package, which works with ggplot2

library(ggmap)

Select the request_type, x, and y columns. x and y are longitude and latitude

(df %>% 
  select(request_type, x, y) %>% 
  filter(!is.na(x), !is.na(y),
         request_type == "Potholes") -> df_map) #remove missing x and y values
## # A tibble: 31,735 x 3
##    request_type     x     y
##    <chr>        <dbl> <dbl>
##  1 Potholes     -79.9  40.5
##  2 Potholes     -79.9  40.5
##  3 Potholes     -79.9  40.5
##  4 Potholes     -79.9  40.5
##  5 Potholes     -79.9  40.5
##  6 Potholes     -80.0  40.4
##  7 Potholes     -79.9  40.5
##  8 Potholes     -79.9  40.5
##  9 Potholes     -79.9  40.5
## 10 Potholes     -79.9  40.5
## # ... with 31,725 more rows
city_map <-  get_map("North Oakland, Pittsburgh, PA", 
                     zoom = 12,
                     maptype = "toner", 
                     source = "stamen")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=North+Oakland,+Pittsburgh,+PA&zoom=12&size=640x640&scale=2&maptype=terrain&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=North%20Oakland,%20Pittsburgh,%20PA&sensor=false
## Map from URL : http://tile.stamen.com/toner/12/1137/1542.png
## Map from URL : http://tile.stamen.com/toner/12/1138/1542.png
## Map from URL : http://tile.stamen.com/toner/12/1139/1542.png
## Map from URL : http://tile.stamen.com/toner/12/1137/1543.png
## Map from URL : http://tile.stamen.com/toner/12/1138/1543.png
## Map from URL : http://tile.stamen.com/toner/12/1139/1543.png
## Map from URL : http://tile.stamen.com/toner/12/1137/1544.png
## Map from URL : http://tile.stamen.com/toner/12/1138/1544.png
## Map from URL : http://tile.stamen.com/toner/12/1139/1544.png
## Map from URL : http://tile.stamen.com/toner/12/1137/1545.png
## Map from URL : http://tile.stamen.com/toner/12/1138/1545.png
## Map from URL : http://tile.stamen.com/toner/12/1139/1545.png
(city_map <- ggmap(city_map))

Put the data on the map

city_map +
  geom_point(data = df_map, aes(x, y, color = request_type)) #graph the data with dots
## Warning: Removed 729 rows containing missing values (geom_point).

There is too much data on the graph. Make the dots more transparent to show density

city_map +
  geom_point(data = df_map, aes(x, y, color = request_type), alpha = .1) #graph the data with dots
## Warning: Removed 729 rows containing missing values (geom_point).

Still not great. Density plots are better for showing overplotted data

#Put the data on the map
city_map +
  stat_density_2d(data = df_map, #Using a 2d density contour
                  aes(x, #longitude
                      y, #latitude,
                      fill = request_type,
                      alpha = ..level..), #Use alpha so you can see the map under the data
                  geom = "polygon") + #We want the contour in a polygon
  scale_alpha_continuous(range = c(.1, 1)) + #manually set the range for the alpha
  guides(alpha = guide_legend("Number of requests"),
         fill = FALSE) +
  labs(title = "Pothole requests in Pittsburgh",
       subtitle = "311 data",
       x = "",
       y = "",
       caption = my_caption) +
  theme_bw(base_family = 18) +
  theme(axis.text = element_blank())
## Warning: Removed 729 rows containing non-finite values (stat_density2d).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database